Exploring Madrid's housing market

created by Juan Ramón Selva and Gladys Kenyon for the SDSC 2022

https://github.com/gladyskenyon/SDSC_22_workshop

Twitter: gekenyon, Email: g.e.kenyon@liverpool.ac.uk

Introduction

'A picture is worth a thousand words'

Often, complex ideas and relationships can be conveyed quickly and more effectively using a single image than by written word. In data science, visualising data is a key step of analysis to explore its form. Additionally, spatial visualisations and analytics are powerful and insightful tools to understand geographical data.

The following workbook takes you through a spatial analysis of the 2018 housing market in Madrid. Quantitative analyis and several geovisualisations will be made to present spatial residential patterns. Finally an unsupervised machine learning algorithm (k-means) allows us to create groups of similar properties (sub-markets) based on their attributes.

The data is provided by Idealista, a major property advertisement company which operates in Southern Europe. The data is openly available and liscenced in the package idealista18 (https://github.com/paezha/idealista18).

There are several data sources used in the analysis. The data folder contains seperate folders for Madrid, Barcleona and Valencia. There is a 'Sale' file for each city, which contains quarterly single family listings for 2018, provided by idealista. Some of the variables within the sales data are sourced from the Cadastral (official building register in Spain). Other data includes the neighourhood polygons built by idealista (idealista_level8), administrative boundaries (city_dist) and building footprints (polygons_inspire_buildings). We will be using the Madrid data, if you have time to spare at the end of the session or in future, there is the same data on github for Barcelona and Valencia.

Contents:

Getting Started

Section 1: Exploratory (Spatial) Data Analysis

Section 2: Unsupervised Machine learning

Getting started

We will start by loading the required libraries and data. The word file 'Metadata' on the github repository provides detail on the attributes included in the property listings data sets. Spend some time getting familiar with the property data we have. We will also create some new variables..

Madrid Sale data

Python Data Types

Data Type Examples
Integers -2, -1, 0, 1, 2, 3, 4, 5
Floating-point numbers -1.25, -1.0, --0.5, 0.0, 0.5, 1.0, 1.25
Strings 'a', 'aa', 'aaa', 'Hello!', '11 cats'

Idealista boundaries

Building footprints

Data Manipulation

Run these lines to create new variables that will be needed for the analysis.

Section 1: Exploratory (Spatial) Data Analysis

In section 1A we will do the following..

Understanding the trends and patterns of data is a key first step in any analysis. A great resource is the python graph gallery, a collection of 100's of charts made with python, including reproducible code. Most of these will be made with the seaborn package. Lets take a look.

Section 1A: Visualisations

Univariate statistics

Univariate analysis is a technique to analyse one variable's range and measures of central tendency (average).

Histograms and Kernel Density Estimation (KDE)

The purpose of a histogram is to understand the distribution of numerical data. The height of the bar represents the number of values in the data set that fall into that bin (equal class intervals of the variable's range). The following code chunk plots a histogram of a variable's frequency. For further information on histograms check out this resource .

KDE is a data smoothing operation, it estimates a probability density function of variable. Read more here. A KDE is added to the histogram the show the smoothed distribution. A rug plot is shown at the bottom of the histogram, it shows the distribution of raw points.

Distribution of unit price

We are particularly interested in the distribution of the price variables. Unit price is a better measure than price as it controls for the size of properties (larger houses tend to be more expensive).

Bar plots

Many of the variables in the dataset are binary, which means they have 2 categories. These indicate the presence of certain structural attributes (see metadata). We can visualise their frequency using bar charts.

Question: What is the frequency of the garden attribute for north, south, east, west variables?

Frequency of adverts across the year 2018

The following bar chart shows the number of adverts in each quarter of the year. There are more houses avertised in the first and final quarter of the year. The most properties (40000 +) are advertised between October and December (201812).

Frequency of houses built across years

We can explore when the properties advertised in 2018 were built using the variable 'conyr'. This gives some indication of the development of the residential market over time.

Visualising bivariate relationships

The following sub-section visualises the relationships that exist between different variables, particularly important is the relationship between price and the property characteristics. In hedonic price modelling, a properties price is estimated based on its attributes; these include internal factors (e.g. number of bedrooms) and external ones (amenities, environment, neighbourhood) (Chau and Chin, 2003) . A hedonic pricing model is often used to estimate quantitative values of properties internal and external characteristics that directly affect market prices for homes. Or to conduct mass appraisal (Wang and Li, 2019).

Scatter Plots

Scatter plots are commonly used to visualise the relationship between two continuous variables.

The function pairplot helps us to visualise multiple bivariate relationships in the data with one plot (info on the function.

We can see from the pairplot that many of relationships between the variables are non-linear (a change in one variable does not correspond with a constant change in another).

Strip plots

Strip plots are a type of scatter plot which are suitable when one of the variables is categorial. The code chunk below shows the relationship between the quality of the housing and the construction year.

Boxplots

Boxplots are a well used technique for visulising the relationship between a continuous and categorical variable. If you are unsure on how to interpret boxplots check out this handy resource.

The following figure plots a boxplot for the binary housing attribute variables and unit price, using the pandas library. See the documentation.

Some key trends:

Correlation

A correlation is a statistical measure of the linear relationship between two variables. The range of scores goes from -1 to 1, a score of 0 = no relationship, 1 = exactly positively correlated, -1 means there relationship is negative and vise versa (if you are unsure see here). A correlation matrix is an excellent way to visualise the correlation between all the variables in the data set. Most of the variables have a correlation score of around 0. Some of the variables (Madrid) are strongly correlated with price e.g. area and price have a positive correlation (0.77). Or price and quality (-0.52). Correlation between variables (known as multicollinearity) can be a problem in regression.

TASK

Can you make a new type of plot using code from the python graph gallery?

Section 1B: Geo-visualisations

Now we have undertaken some exploratory data analysis, we can turn our focus to the exciting bit.. understanding the distribution of variables over space.

'Mapmaking, or cartography, is the visualization of geospatial data. It’s an art in that it seeks to represent data in a form that can be more easily understood and interpreted by non-technical audiences. But it’s also a science in making sure the visuals accurately conform to the data that they’re based on.' (Read more here].

In the following section we will visualise the spatial variations in housing density and attributes using:

Spatial exploration

In geodataframes, the geometry column is important; it is where we define the spatial location of points, lines and polygons (geoseries). Read more about geopandas data structures in the documentation.

Geo-visualization

Hex-Binning

The Madrid map shows a lot of overlap in the points. When datasets have a large number of oberservations concentrated in one area it is harder to differentiate patterns. Hexagon binning is a form of bivariate histogram, which aggregates the data according to a hexagonal grid.

The underlying concept of hexagon binning =

  1. the xy plane over the set (range(x), range(y)) is tessellated by a regular grid of hexagons.
  2. the number of points falling in each hexagon are counted and stored in a data structure
  3. the hexagons with count > 0 are plotted using a color ramp or varying the radius of the hexagon in proportion to the counts. The underlying algorithm is extremely fast and effective

Lewin-Koh, 2021

Read this fab blog about using hexagon grids.

We will now visualise the density of adverts using hex-binning. To do so we will use the .hexbin function from the pandas library (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hexbin.html).

Now we have visualised the location of the adverts we can take this further and look at the spatial variation in a attribute of interest using hex bins. Try changing the colours using the 'cmap' option. See this resource for a list of colours and information on how to choose a good colour scheme for your map.

Spatial KDE (heat map)

We talked about KDE briefly in the first section. We can take this technique a step further and implement it over space using seaborns .kdeplot. (https://seaborn.pydata.org/generated/seaborn.kdeplot.html).

KDE differs from hexbinning, rather then using discrete bins, they are continuous.

Point Pattern Visualisations

We can colour the points based on the value of their attributes. This doesn't require aggregation of the data, but we may encounter the problem of overplotting. Try changing the column ='p_decile' to explore different variables.

Sub setting the data

The documentation for the .loc function can be found here (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)

Choropleth Mapping

Choropleth Maps display divided geographical areas or regions that are coloured, shaded or patterned in relation to a data variable. This provides a way to visualise values over a geographical area, which can show variation or patterns across the displayed location.

The data variable uses colour progression to represent itself in each region of the map. Typically, this can be a blending from one colour to another, a single hue progression, transparent to opaque, light to dark or an entire colour spectrum.

One downside to the use of colour is that you can't accurately read or compare values from the map. Another issue is that larger regions appear more emphasised then smaller ones, so the viewer's perception of the shaded values are affected.

Check out the boundaries we will use

Create new data set which aggregates the points into areas

First option

Second option

Plot a choropleth map of unit price

When mapping the average price on a choropleth, we split the data to create categories. There are different ways of doing this. Check out this tutorial for different ways of doing this (http://darribas.org/gds15/content/labs/lab_04.html). In the example below we have used quantiles and equal intervals.

TASK

Can you map a different variable on a choropleth using the administrative boundaries instead of the idealista boundaries?

Building footprints

We have downloaded the building footprints from the Cadastral. The data is a geodataframe with a polygon geometry for each building. We can use the building footprints to visualise housing attributes. This overcomes the issue of overplotting points.

Now, we explore the cadastral dataset. We can join the cadastral data to the neighbourhood polygons, to plot the building footprint of a single area.

Section 2: Unsupervised Machine Learning (K- means clustering)

We will now move on to do some machine learning with the data. We are going to group the properties based on their attributes using an algorithmn called K-means. Check out this user guide on the package. We will then be able to plot the categories to understand their spatial organisation into sub-markets.

A geodemographic analysis involves the classification of the areas that make up a greographical map into groups or categories of observations that are similar within each other but different between them. The classification is carried out using a statistical clustering algorithm that takes as input a set of attributes and returns the group (“labels” in the terminology) each observation belongs to. Depending on the particular algorithm employed, additional parameters, such as the desired number of clusters employed or more advanced tuning parameters (e.g. bandwith, radius, etc.), also need to be entered as inputs. (Arribas-Bel, 2018).

There are many expamples of geodemographic clustering in the fields of healthcare, retail, socio-economics, eduation and policy. For example Singleton et al., (2020).

Although the underlying algorithm is not trivial, running K-means in Python is streamlined thanks to scikit-learn. Similar to the extensive set of available algorithms in the library, its computation is a matter of two lines of code. First, we need to specify the parameters in the KMeans method (which is part of scikit-learn’s cluster submodule) (Arribas-Bel, 2018).

Creating the clusters

But what do the different clusters mean in terms of property attributes?

TASK

Add in more variables to the list of features we are clustering on. How does this change the clusters?

Mapping the clusters

We will start by mapping the clusters without any boundaries..

Let's add in some boundaries

Finally we can map the clusters with the building footprints

This gives a good indication of where we are missing data.

Evaluating the clusters

Now we have created our classification, we need to asses how good our categories are. We also need to fine tune our parameters (number of clusters). There are a number of ways to do this, we will be looking at silhoutte scores, and using elbow plots.

Elbow plot

To decide on the optimum number of clusters using an elbow plot, we need to select the value of k at the 'elbow', the point at which the line starts to decrease in a linear fashion.

Key points

Silhoutte score

Silhouette analysis is used to explore the separation distance between the resulting clusters. This measure has a range of [-1, 1].

Key points

Section 2B: OPTIONAL EXERCISE

Repeat the cluster analysis for the other 2 cities. It would be interesting to combine the datasets and cluster across all three cities along the same criteria. Find the data in the github repository!